NanostringPPGVignette

Introduction

This vignette aims to demonstrate the skin PPG workflow. This workflow starts from .dcc files, pkc file, and annotation files and works through the preprocessing and exploratory analysis. The code folder in the main directory of this repository contains the .R files corresponding to each of the analysis steps. This pipeline is currently in progress and will have more analysis steps added to it (for example, deconvolution). Below is a demonstration of the currently steps of this pipelineline.

image

In the following sections of this vignette, we will walk through each substep of the analysis to thoroughly explain what the code is doing. Currently, the pipeline has more functionality than necessary, and some substeps can be “turned off and on” when necessary.

Before we begin, we need to load the packages required for this workflow.

Code
################################################################################
## The purpose of this script is to load and install the required packages 
## For the workflow
################################################################################

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

# The following initializes most up to date version of Bioc
#BiocManager::install(version="3.15")

#BiocManager::install("NanoStringNCTools")
#BiocManager::install("GeomxTools")
#BiocManager::install("GeoMxWorkflows")

# For preprocessing
library(NanoStringNCTools)
library(GeomxTools)
library(GeoMxWorkflows)
library(knitr)
library(ggplot2)
library(ggforce)
library(dplyr)
library(scales) # for percent
library(reshape2)  
library(cowplot) 


# For batch correction
library(sva)

# For dim reduction
library(umap)
library(Rtsne)
library(pheatmap)  # for pheatmap

# FOr modeling
library(geepack)

source(paste0("../R/",dir("../R")))

QC and Normalization

The quality control and normalization section starts with the raw data files (.dcc, pkc, and annotation files) and works through the following substeps:

  • Data loading

  • Quality control

    • Segment QC (Default = ON)

    • Probe QC (Default = ON)

    • Limit of quanitification (LOQ) (Default = ON)

  • Normalization

    • Quantile 3 (Q3) normalization

    • Background normalization

    • Transcripts per million (TPM) (default = ON)

  • Log transformation (Default = ON, base=2)

  • Batch correction (Default = ON)

We demonstrate the order and structure of these steps in the diagram below. Please note, in this diagram the arrows do not represent a dependency but rather to show the order the data is processed. For example, we do not need to run segment QC in order to run probe QC. However, if both segment QC and probe QC are turned “ON”, then the segment QC will be ran first.

QC and Normalization Workflow

To see the detailed information about segment QC, probe QC, LOQ, and batch correction, click the drop downs below.

Every ROI/AOI segment is tested for:

  • Raw sequencing reads: segments with >1000 raw reads are removed.

  • % Aligned,% Trimmed, or % Stitched sequencing reads: segments below ~80% for one or more of these QC parameters are removed.

  • % Sequencing saturation ([1-deduplicated reads/aligned reads]%): segments below ~50% require additional sequencing to capture full sample diversity and are not typically analyzed until improved.

  • Negative Count: this is the geometric mean of the several unique negative probes in the GeoMx panel that do not target mRNA and establish the background count level per segment; segments with low negative counts (1-10) are not necessarily removed but may be studied closer for low endogenous gene signal and/or insufficient tissue sampling.

  • No Template Control (NTC) count: values >1,000 could indicate contamination for the segments associated with this NTC; however, in cases where the NTC count is between 1,000- 10,000, the segments may be used if the NTC data is uniformly low (e.g. 0-2 counts for all probes).

  • Nuclei: >100 nuclei per segment is generally recommended; however, this cutoff is highly study/tissue dependent and may need to be reduced; what is most important is consistency in the nuclei distribution for segments within the study.

  • Area: generally correlates with nuclei; a strict cutoff is not generally applied based on area.

Default Values Of Segment QC Metrics
Metric Default value
Aligned, trimmed, and stitched percentage 80 %
Sequencing saturation 50 %
Negative count 1
No template control 9000
Min. Nuclei 20
Min. Area 1000

We will remove low-performing probes before summarizing our data into gene-level count data. In short, this QC is an outlier removal process whereby we remove probes in one of two ways.

  • Entirely from the study (global)

  • From specific segments (local).

A probe is removed globally from the dataset if either of the following is true:

  • The geometric mean of that probe’s counts from all samples divided by the geometric mean of all probe counts representing the target from all samples is less than 0.1. (this only effects the negative control probes)

  • The probe is an outlier according to the Grubb’s test in at least 20% of the segments.

We remove a probe locally (from a given segment) if the probe is an outlier according to the Grubb’s test in that segment.

The nanostring github shows they are performing the Grubbs test on the log base ten scale for all probe values within a sample. Below is the Grubbs test procedure they followed.

Grubbs test:

The Grubbs test, tests if there exists at least 1 outlier in data that is approximately normal.

H0: There does not exist an outlier. H1: Is an outlier.

The Grubbs test is defined as:

Let \(Y_{ij}\) be the \(i^{th}\) probe in the \(j^{th}\) sample. The Grubbs test statistic for the \(j^{th}\) sample is:

\[ G_j = \frac{max |Y_{ij} - \bar{Y_j}|}{s_j} \] where \(\bar{Y}_j\) is the sample mean and \(s_j\) is the standard deviation. We reject H0 if:

\[ G_j > \frac{N-1}{\sqrt{N}}\sqrt{\frac{t^2_{\alpha/2N, N-2}}{N-2 + t^2_{\alpha/2N, N-2} }} \]

From their code, they run the Grubbs test on each sample, return the probe name with the max value if the Grubb’s test rejected the null hypothesis.

In addition to Segment and Probe QC, we also determine the limit of quantification (LOQ) per segment. The LOQ is calculated based on the distribution of negative control probes and is intended to approximate the quantifiable limit of gene expression per segment . Please note that this process is more stable in larger segments. Likewise, the LOQ may not be as accurately reflective of true signal detection rates in segments with low negative probe counts (ex: <2). The formula for calculating the LOQ in the \(i^{th}\) segment is:

\[LOQ_{i} = geomean(NegProbe_{i}) * geoSD(NegProbe_{i})^{n}\]

We typically use 2 geometric standard deviations (\(n = 2\)) above the geometric mean as the LOQ, which is reasonable for most studies. We also recommend that a minimum LOQ of 2 be used if the LOQ calculated in a segment is below this threshold.

Filtering using LOQ

After determining the limit of quantification (LOQ) per segment, nanostring recommends filtering out either segments and/or genes with abnormally low signals.

Nanostring determines the number of genes detected in each segment across the dataset. We plot the number of segments based on the percentage of genes that fall above the LOQ threshold, allowing us to see the impact of filtering. Currently, the code filters all segments with less than 10% of genes above the LOQ threshold. Additionally, any genes not falling above the LOQ threshold in at least 10% of segments are filtered out.

Combat allows users to adjust for batch effects in datasets where the batch covariate is known, using the methodology described in Johnson et al. 2007. It uses parametric or non-parametric empirical Bayes frameworks to adjust data for batch effects. By default, this method uses a parametric adjustment. This method returns an expression matrix with corrected batch effects. The input data are assumed to be cleaned and normalized before batch effect removal.

All the places you need to edit the code are at the top of the script. You will need to edit

  • Which QC parts to use

    • Segment QC (SegmentQC)

    • Probe QC (ProveQC)

    • Limit of quantification (LOQ)

  • Normalization method to use

    • The options are

      • Quantile normalization (q_norm)

      • Background normalization (b_norm)

      • Transcripts per million (TPM)

  • Log transformation (log_transformation)

    • If you set log_transformation= TRUE, then you must provide a base to use.
  • Batch effect (batchEff)

Once you have identified the parts to include in the analysis, you will need to provide the variable names in the annotation file that contain the following information:

  • ROI

  • AOI

  • Batch (only include if batchEff=TRUE)

  • DCC names (DCC_col)

    • This is a variable that links the metadata to the .dcc files.

Finally, provide paths to the folders that contain the .dcc files, the pkc file, and the annotation excel file.

Code
###############################################################################
###  This script is used for the following steps
###  1. Data loading (.DCC files, PKC files and sample annotation)
##   2. Quality Control
##      - SegmentQC (default = ON)
##      - ProbeQC (default = ON)
##      - Limit of Quantification QC (LOQ) (default = OFF)
##   3. Normalization
##      - Q3 Normalization
##      - Background Normalization
##      - Transcripts per million (default = ON)
##
##  4. Log transformation (default = ON, base=2)
##
##  5. Batch correction (default = ON)
###############################################################################

# Which QC metrics would you like to use?
SegmentQC = T
ProbeQC = T
LOQ = F


# Which normalization would you like to use? 
# Options at q_norm (Q3 normalization ), b_norm (background normalization)
#. or TPM (transcripts per million)

norm_method = "TPM"


# Do you want to use a log transformation? If so what base do you want to use?
log_transformation = T

base = 2



# Do you want to include batch effect adjustment?
batchEff = T



# What are the ROI and AOI names in the annotation file

ROI = "region"

AOI = "SegmentLabel"

# Batch is only required if batchEff=T
batch = "Experiment"


DCC_col = "Sample_ID"

###############################################################################
### Step 1: 
### DCC files (folder), PKC Files (folder)
### and Sample annotation (folder)
###############################################################################


# Provide path to the DCC files, PKC file and Sample annotation
DCC = "../Data/Raw/DCC"

PKC = "../Data/Raw/PKC"

SampleAnnotation = "../Data/Raw/Annotation"


DCC_col = "Sample_ID"

######################################
#### Get all paths for the raw data
######################################

# DCC
DCCFiles <- dir(DCC, pattern = ".dcc$",
                full.names = TRUE, recursive = TRUE)


# PKC
PKCFiles <- dir(PKC, pattern = ".pkc$", full.names = TRUE, recursive = TRUE)


# Sample annotation (Alternatively you could provide a direct path)
SampleAnnotationFile <- dir(SampleAnnotation, pattern = ".xlsx$",
      full.names = TRUE, recursive = TRUE)

#########################################
### Load all data into the nanostring 
### object to store all data together.
#########################################
# 1. load data (This may take a few moments)
demoData <- readNanoStringGeoMxSet(dccFiles = DCCFiles,
                                   pkcFiles = PKCFiles,
                                   phenoDataFile = SampleAnnotationFile,
                                   phenoDataSheet = "Sheet1",
                                   phenoDataDccColName = DCC_col
)



################################################################################
### Step 2: Quality Control
###   Consists of the following subparts
###     - Count shift
###     - Segment QC
###     - Probe QC
###     - LOQ (Limit of quantification filtering
###############################################################################

##############################################################
# Shift counts to one
#############################################################
demoData <- shiftCountsOne(demoData, useDALogic = TRUE)



############################################################
## Segment QC
############################################################
if(SegmentQC==T){
    # Set the QC parameters that we wish to use. These can be edited.  
    QC_params <-
      list(minSegmentReads = 1000, # Minimum number of reads (1000)
           percentTrimmed = 80,    # Minimum % of reads trimmed (80%)
           percentStitched = 80,   # Minimum % of reads stitched (80%)
           percentAligned = 80,    # Minimum % of reads aligned (80%)
           percentSaturation = 50, # Minimum sequencing saturation (50%)
           minNegativeCount = 1,   # Minimum negative control counts (1)
           maxNTCCount = 9000,     # Maximum counts observed in NTC well (9000)
           minNuclei = 20,         # Minimum # of nuclei estimated (20)
           minArea = 1000)         # Minimum segment area (1000)
    
    
    # Include the QC parameters in the nanostring object 
    demoData <- setSegmentQCFlags(demoData, 
                        qcCutoffs = QC_params)   
    
    ##############################################
    ###  Create QC summary table based on
    ###  these flags.
    ##############################################
    
    QCResults <- protocolData(demoData)[["QCFlags"]]
    flag_columns <- colnames(QCResults)
    
    
    QC_Summary <- data.frame(Pass = colSums(!QCResults[, flag_columns]),
                             Warning = colSums(QCResults[, flag_columns]))
    QCResults$QCStatus <- apply(QCResults, 1L, function(x) {
      ifelse(sum(x) == 0L, "PASS", "WARNING")
    })
    
    
    # Display summary of the results
  # print( knitr::kable(QC_Summary["TOTAL FLAGS", ] <-
    #  c(sum(QCResults[, "QCStatus"] == "PASS"),
   #     sum(QCResults[, "QCStatus"] == "WARNING"))))
    
    
    ################################################
    ### Visualize segment QC
    ################################################
    
    col_by <- AOI
    
    # Trimmed % 
   print( QC_histogram(sData(demoData), "Trimmed (%)", col_by, QC_params$percentTrimmed))
    
    # Stitched %
    print(QC_histogram(sData(demoData), "Stitched (%)", col_by, QC_params$percentStitched))
    
    # Aligned %
    print(QC_histogram(sData(demoData), "Aligned (%)", col_by, QC_params$percentAligned))
    
    # Saturated %
   print(QC_histogram(sData(demoData), "Saturated (%)", col_by, QC_params$percentSaturation) +
      labs(title = "Sequencing Saturation (%)",
           x = "Sequencing Saturation (%)"))
    
    # Area
    print(QC_histogram(sData(demoData), "AOISurfaceArea", col_by, QC_params$minArea,
                 scale_trans = "log10"))
    
    # Nuclei
    print(QC_histogram(sData(demoData), "AOINucleiCount", col_by, QC_params$minNuclei))
    
    
    #################################################
    ## Calculate and plot geometric mean of the 
    ## Negative probes for each segment.
    #################################################
    
    # gene annotations modules
    #  Assign the pkcs variable to the name of the gene annotation file 
    pkcs <- annotation(demoData)
    
    
    #  Create table with gene annotation file names
    modules <- gsub(".pkc", "", pkcs)
    
    
    #  Calculate the geometric means of the negative controls for each of the samples.
    negativeGeoMeans <- 
      esBy(negativeControlSubset(demoData), 
           GROUP = "Module", 
           FUN = function(x) { 
             assayDataApply(x, MARGIN = 2, FUN = ngeoMean, elt = "exprs") 
           }) 
    
    
    # Store the geometric means in the protocol (metadata)
    protocolData(demoData)[["NegGeoMean"]] <- negativeGeoMeans
    
    
    #  Explicitly copy the Negative geoMeans from sData to pData
    negCols <- paste0("NegGeoMean_", modules)
    pData(demoData)[, negCols] <- sData(demoData)[["NegGeoMean"]]
    
    
    #  Plot the geoMetric means for each sample.  
    for(ann in negCols) {
      plt <- QC_histogram(pData(demoData), ann, col_by, QC_params$minNegativeCount, scale_trans = "log10")
      print(plt)
    }
    
    
    
    ###############################
    ## Display Segment QC results
    ## and filter segments
    ###############################
    
    # Segment QC Summary
    print(kable(QC_Summary, caption = "QC Summary Table for each Segment"))
    
    # Filter only segments that passed 
    # Subset data only on the samples that passed the QC Metrics. 
    demoData <- demoData[, QCResults$QCStatus == "PASS"]
    
    
}   

QC Summary Table for each Segment
Pass Warning
LowReads 136 0
LowTrimmed 136 0
LowStitched 136 0
LowAligned 136 0
LowSaturation 136 0
LowNegatives 136 0
Code
####################################################################
###  Probe QC
####################################################################

if(ProbeQC == T){

      # Set flag values
    demoData <- setBioProbeQCFlags(demoData, 
                                   qcCutoffs = list(minProbeRatio = 0.1,
                                                    percentFailGrubbs = 20), 
                                   removeLocalOutliers = TRUE)
    
    
    # 2. Data set with raised flaggs for each probe. 
    ProbeQCResults <- fData(demoData)[["QCFlags"]]
    
    
    # 3. Table with flag summary count. 
    qc_df <- data.frame(Passed = sum(rowSums(ProbeQCResults[, -1]) == 0),
                        Global = sum(ProbeQCResults$GlobalGrubbsOutlier),
                        Local = sum(rowSums(ProbeQCResults[, -2:-1]) > 0
                                    & !ProbeQCResults$GlobalGrubbsOutlier))
    
    print(kable(qc_df, caption = "Probes flagged or passed as outliers"))
    
    
    ##################################
    ### Filter Probes
    ##################################
    
    #Subset object to exclude all that did not pass Ratio & Global testing
    ProbeQCPassed <- 
      subset(demoData, 
             fData(demoData)[["QCFlags"]][,c("LowProbeRatio")] == FALSE &
               fData(demoData)[["QCFlags"]][,c("GlobalGrubbsOutlier")] == FALSE)
    
    # Resign to demoData
    demoData <- ProbeQCPassed 
    
}    
Probes flagged or passed as outliers
Passed Global Local
18796 0 19
Code
##################################
## Create gene level count matrix
##################################


# Aggregate counts with the geometric mean. 
target_demoData <- aggregateCounts(demoData)



################################################################
###  Limit of Quantification (LOQ)
################################################################
if(LOQ==TRUE){
    ################################
    ## Calculate LOQ
    ###############################
    # Define the LOQ cutoff and the minLOQ. 
    cutoff <- 2
    minLOQ <- 2
    
    
    
    # Calculate LOQ per module tested
    LOQ <- data.frame(row.names = colnames(target_demoData))
    for(module in modules) {
      vars <- paste0(c("NegGeoMean_", "NegGeoSD_"),
                     module)
      if(all(vars[1:2] %in% colnames(pData(target_demoData)))) {
        LOQ[, module] <-
          pmax(minLOQ,
               pData(target_demoData)[, vars[1]] * 
                 pData(target_demoData)[, vars[2]] ^ cutoff)
      }
    }
    
    
    # Store the LOQ information.
    pData(target_demoData)$LOQ <- LOQ
    
    
    #####################################
    ### Create matrix with LOQ
    ### Pass/Fail information
    #####################################
    
    
    
    #  Initialize an empty vector to store the LOQ_Mat values.
    LOQ_Mat <- c()
    
    # Run for loop.
    for(module in modules) {
      ind <- fData(target_demoData)$Module == module
      
      # 2. Check if each value is greater than the corresponding LOQ value.
      Mat_i <- t(esApply(target_demoData[ind, ], MARGIN = 1,
                         FUN = function(x) {
                           x > LOQ[, module]
                         }))
      LOQ_Mat <- rbind(LOQ_Mat, Mat_i)
    }
    
    
    # 3. Ensure ordering since this is stored outside of the geomxSet
    LOQ_Mat <- LOQ_Mat[fData(target_demoData)$TargetName, ]
    
    
    #####################################
    ## Segment gene detection rate plot
    #####################################
    
    # 1. Save detection rate information to pheno data
    pData(target_demoData)$GenesDetected <- 
      colSums(LOQ_Mat, na.rm = TRUE)
    pData(target_demoData)$GeneDetectionRate <-
      pData(target_demoData)$GenesDetected / nrow(target_demoData)
    
    
    # 2. Classify each sample based on the percentage of genes above the LOQ. 
    pData(target_demoData)$DetectionThreshold <- 
      cut(pData(target_demoData)$GeneDetectionRate,
          breaks = c(0, 0.01, 0.05, 0.1, 0.15, 1),
          labels = c("<1%", "1-5%", "5-10%", "10-15%", ">15%"))
    
    
    # 3. Create a barplot of the different classes. (1%, 5%, 10%, 15%)
   
    geneplot <- ggplot(pData(target_demoData),
           aes(x = DetectionThreshold)) +
      geom_bar(aes(fill = get(ROI))) +
      geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
      theme_bw() +
      scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
      labs(x = "Gene Detection Rate",
           y = "Segments, #",
           fill = "Segment Type", title = "Number of Samples in Gene Detection Rate Samples")
    
    print(geneplot)
    
    #######################################
    ### Filter segments based on 
    ### gene detection rate
    #######################################
    
    target_demoData <- target_demoData[, pData(target_demoData)$GeneDetectionRate >= .01]
    
    
    #######################################
    ## Gene gene detection rate plot
    #######################################
    
    # Filter LOQ matrix
    LOQ_Mat <- LOQ_Mat[, colnames(target_demoData)]
    
    
    # Calculate the number of samples detected for each sample.
    fData(target_demoData)$DetectedSegments <- rowSums(LOQ_Mat, na.rm = TRUE)
    
    
    # Save the dection rate for each gene.
    fData(target_demoData)$DetectionRate <-
      fData(target_demoData)$DetectedSegments / nrow(pData(target_demoData))
    
    
    
    # Create a data frame with for plotting with a frequency column.
    plot_detect <- data.frame(Freq = c(1, 5, 10, 20, 30, 50))
    
    
    # Calculate number of genes above each threshold.
    plot_detect$Number <-
      unlist(lapply(c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5),
                    function(x) {sum(fData(target_demoData)$DetectionRate >= x)}))
    
    
    # Calculate the percentage of genes above this threshold. 
    plot_detect$Rate <- plot_detect$Number / nrow(fData(target_demoData))
    rownames(plot_detect) <- plot_detect$Freq
    
    
    # Plot gene detection rate
    print( ggplot(plot_detect, aes(x = as.factor(Freq), y = Rate, fill = Rate)) +
      geom_bar(stat = "identity") +
      geom_text(aes(label = formatC(Number, format = "d", big.mark = ",")),
                vjust = 1.6, color = "black", size = 4) +
      scale_fill_gradient2(low = "orange2", mid = "lightblue",
                           high = "dodgerblue3", midpoint = 0.65,
                           limits = c(0,1),
                           labels = scales::percent) +
      theme_bw() +
      scale_y_continuous(labels = scales::percent, limits = c(0,1),
                         expand = expansion(mult = c(0, 0))) +
      labs(x = "% of Segments",
           y = "Genes Detected, % of Panel > LOQ")
    
    )
    # Save plot
    #ggsave("../Outputs/Plots/LOQGeneDetectionRatePlot.png")
    
    
    #########################################
    ### Filter genes based on gene detection
    ### Rate
    ##########################################
    #  Subset to target genes detected in at least 10% of the samples.
    negativeProbefData <- subset(fData(target_demoData), CodeClass == "Negative")
    neg_probes <- unique(negativeProbefData$TargetName)
    
    
    # Filter out genes detected in less than 10% of genes. 
    target_demoData <- 
      target_demoData[fData(target_demoData)$DetectionRate >= 0.1 |
                        fData(target_demoData)$TargetName %in% neg_probes, ]
    
}    


################################################################################
### Step 3: Normalization
###   There are two types of normalization possible
###     - Q3 normalization
###     - Background Normalization
###     - Transcripts per million (default)
################################################################################

##################################
### Q3 Normalization
##################################

if(norm_method == "q_norm"){

  # Q3 norm (75th percentile) for WTA/CTA  with or without custom spike-ins
  target_demoData <- normalize(target_demoData ,
                               norm_method = "quant", 
                               desiredQuantile = .75,
                               toElt = norm_method)

}

###################################
## Background normalization
###################################

if(norm_method == "b_norm"){

  # Background normalization for WTA/CTA without custom spike-in
  target_demoData <- normalize(target_demoData ,
                               norm_method = "neg", 
                               fromElt = "exprs",
                               toElt = norm_method)


}


#############################################
## Transcripts per-million standardization
#############################################

if(norm_method== "TPM"){
  
  
  TPM_norm = apply(target_demoData@assayData$exprs, 2,
                   FUN = function(x){ (x/sum(x))*1000000 })
  
  
  # Add assay data
  assayDataElement(target_demoData,"TPM") = TPM_norm
  
  
}



##############################################
### Log transformation
##############################################

if(log_transformation ==T) {
  
  
  # log transform assay data
  assayLogT = log(target_demoData@assayData[[norm_method]], base = base)
  
  
  # Add assay data to the 
  assayDataElement(target_demoData,paste0("logT-base",base)) = assayLogT
  
  
}





################################################################################
## Batch Effects (here batch info is captured in "Expiriment variable)
##. There are two cases for this because if we used log transformation then
## we want to use the log transformation assay slot. 
################################################################################
if(batchEff==T){
  
  # If we didnt log transform then we want to use the norm_method slot. 
  if(log_transformation==F){
  
      # Do batch correction
      adjusted_expression <- ComBat(dat = target_demoData@assayData[[norm_method]],
                                    batch = target_demoData@phenoData@data$Experiment,mod = NULL, prior.plots = F)
      
      # Save batch corrected results in nanostring object
      assayDataElement(target_demoData,"batch_corrected") = adjusted_expression
  }
  
  
  # if log transformed, we will want to you to log transformation assay slot
  if(log_transformation==T){
    
    # Do batch correction
    adjusted_expression <- ComBat(dat = target_demoData@assayData[[paste0("logT-base",base)]],
                                  batch = target_demoData@phenoData@data$Experiment,mod = NULL, prior.plots = F)
    
    # Save batch corrected results in nanostring object
    assayDataElement(target_demoData,"batch_corrected") = adjusted_expression
  }
  
  
}



##################################
## Save Data
##################################
save(target_demoData,file = "../Data/Processed/NormalizedData.Rdata") 

Exploratory analysis

The exploratory analysis allows us to visualize our data to see high-level patterns without requiring a formal hypothesis. Additionally, the exploratory analysis can serve as a quality control tool. For example, when we do dimensionality reduction, we may notice that most of the variance in the data is coming from the different batches, and we may want to go back to preprocessing and do batch correction before continuing with the downstream analysis. In this exploratory analysis, we are using:

  • Uniform Manifold Projection (UMAP) for dimensionality reduction.

  • Heatmaps

Uniform manifold approximation and projection (UMAP) is a non-linear dimensionality reduction method rooted in topology. The goal of UMAP is to find a low-dimensional representation that best preserves the local structure of the high-dimensional space. The original method was published in 2018 and is a staple in scRNA-seq pipelines. The mathematical details for UMAP are complex. However, the authors of UMAP provide a high-level introduction to UMAP on their website. Becht et. al show the advantages of UMAP for visualization of scRNA-seq over traditional methods like PCA and t-SNE.

First, you must “turn off/on” the UMAP and heatmap sections. By default, these are turned on; to turn them off, set UMAP =FALSE and/or Heatmap = FALSE. Then, you will need to provide a path to the data object, which is the output of the QC and normalization step. Since it is possible to have multiple expression matrices after QC and normalization, you need to specify which you want to use. You can find the expression data options using targetdemoData@assays. By default, we are using the batch-corrected data (elt=“batch_corrected”). Finally, for the exploratory analysis, you must provide the names of the ROI, AOI, slide name, and batch information. If you are not using batch correction, you can set this to “batch=NULL”

Code
###############################################################################
## Eploratory Analysis
##    This consists of two parts
##      - Dimintionality reduction
##      - Heatmaps
################################################################################

## Which exploratory analysis section would you like to include?
UMAP = TRUE

Heatmap = TRUE

# What is the path to the processed data?
data_path = "../Data/Processed/NormalizedData.Rdata"

# Which assay to use? to see options use: target_demoData@assayData$
elt = "batch_corrected"

# Provide the variable names for ROI, AOI, and slide names
ROI = "region"

AOI = "SegmentLabel"

batch = "Experiment"

slideName = "SlideName"

######################################
## Load Processed Data
######################################

# You may need to edit this code to update path
load(data_path) 



######################################
## Sample Distribution
######################################


knitr::kable(target_demoData@phenoData@data %>%
               select(as.name(slideName),as.name(AOI),as.name(ROI)) %>%
               group_by(across(all_of(c(slideName, ROI, AOI))))%>%
               summarise(N=n()) %>%
               dcast(formula(paste0(slideName,"~",ROI,"+",AOI))),
             caption = "Distribution of Samples for each slide acrross ROI and (AOI)") %>%
  kableExtra::kable_paper()
`summarise()` has grouped output by 'SlideName', 'region'. You can override
using the `.groups` argument.
Using N as value column: use value.var to override.
Distribution of Samples for each slide acrross ROI and (AOI)
SlideName AK Center_panckNeg AK Center_panckPos AK Edge_panckNeg AK Edge_panckPos Sun Protected_panckNeg Sun Protected_panckPos
007 3 3 2 2 5 5
015 3 3 3 3 5 5
080 3 3 3 3 5 6
086 3 3 3 3 5 6
SAZT20088-1 SATZ20088-3 3 3 3 3 6 6
SAZT20099-1 SAZT20099-3 3 3 3 3 6 6
Code
######################################
## UMAP 
######################################

if(UMAP == TRUE){
    # update defaults for umap to contain a stable random_state (seed)
    custom_umap <- umap::umap.defaults 
    custom_umap$random_state <- 42 
    
    
    # run UMAP
    umap_out <-
      umap(t(assayDataElement(target_demoData , elt = elt)), 
           config = custom_umap)  # Uses default umap settings
    
    
    # Save the results from the UMAP in the data object
    pData(target_demoData)[, c("UMAP1", "UMAP2")] <- umap_out$layout[, c(1,2)] 
    
    if(is.null(batch)){
      
    # Plot using ggplot
    ggplot(pData(target_demoData), 
           aes(x = UMAP1, y = UMAP2, color = !!as.name(ROI))) + 
      geom_point(size = 3) + 
      theme_bw() + ggtitle("Dim Reduction Using UMAP.")
      
    } else {
      # Plot using ggplot
      ggplot(pData(target_demoData), 
             aes(x = UMAP1, y = UMAP2, color = !!as.name(ROI), shape = !!as.name(batch))) + 
        geom_point(size = 3) + 
        theme_bw() + ggtitle("Dim Reduction Using UMAP.")
      
      
    }
}
Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
Also defined by 'spam'

Code
######################################
## Heatmap
######################################

if(Heatmap == TRUE){

    # Log transform gene expression data. 
    mat <- target_demoData@assayData[[elt]]
    
    
    # Summarize genes by variance and get the top 100
    var_genes <- apply(mat,1,var)
    
    top50Genes <- var_genes[order(var_genes, decreasing = T)[1:50]]
    
    
    annot <- target_demoData@phenoData@data %>%
      select(all_of(c(ROI,AOI,slideName))) %>%
      arrange(!!as.name(slideName), !!as.name(AOI), !!as.name(ROI)) 
    
    
    
    
    pheatmap(mat[names(top50Genes),rownames(annot)], cluster_rows = F, cluster_cols = F,
             annotation_col = annot, show_rownames = F, show_colnames = F, 
             main = "Top 50 variable genes (log2-transformed) gene expression", border_color = "black" )
}

Trajectory analysis

The goal of this analysis is to find genes that have either a upward or downward trajectory from normal to AK Edge to AK center. In this example, we will assign numeric values to normal, AK edge and AK Center regions of interest.

ROI Numeric Value
Normal 1
AK Edge 2
AK Center 3

In this example, the analysis will be stratified by AOI (CK + and CK -). We will model each gene using generalized estimating equations (GEE) with the subject as the grouping variable. GEE’s require us to specifiy a working correlation, for this we will use the exchangeable working correlation. We give an example of this modeling strategy for the A2M gene.

GEE modeling is used for each gene and we will focus on genes that that a significant trajectory from normal to AK Center. We summarize the number of significant genes, proportion of significant genes and the number of genes with upward and downward trajectories.

We then show heatmaps focusing on the genes with upward/downward trajectories for each AOI. The values used in the heatmaps are normalized to a standard normal distribution to allow us to see all the trajectories on the same scale. From these heatmaps we may be able to identify a few genes that we want to take a closer look at. For this we plot the trajectories and the 95% confidence interval for the trajectories for specific gene.

Code
#############################################################################
## Trajectory analysis. Here we are testing for a trajectory from no disease to 
## Disease. 
#############################################################################


# What is the path to the processed data?
data_path = "../Data/Processed/NormalizedData.Rdata"

# Which assay to use? to see options use: target_demoData@assayData$ 
# here we are using the batch corrected log-transformed
elt = "batch_corrected"

# Provide the variable names for ROI, AOI, and slide names
ROI = "region"

AOI = "SegmentLabel"

slideName = "SlideName"

subject_id = "SlideName"



######################################
## Load Processed Data
######################################

# You may need to edit this code to update path
load(data_path) 



######################################
## Sampling distributions
######################################
target_demoData@phenoData@data %>%
               select(as.name(slideName),as.name(AOI),as.name(ROI)) %>%
               group_by(across(all_of(c(slideName, ROI, AOI))))%>%
               summarise(N=n()) %>%
               dcast(formula(paste0(slideName,"~",ROI,"+",AOI))) %>%
              knitr::kable(col.names = c("SubjectID",
                                          "AK Center (CK-)",
                                         "AK Center (CK+)",
                                         "AK Edge (CK-)",
                                         "AK Edge (CK+)", 
                                         "Sun Protected (CK-)",
                                         "Sun Protected (CK+)"),caption = "Sampling distribution across subjects")
`summarise()` has grouped output by 'SlideName', 'region'. You can override
using the `.groups` argument.
Using N as value column: use value.var to override.
Sampling distribution across subjects
SubjectID AK Center (CK-) AK Center (CK+) AK Edge (CK-) AK Edge (CK+) Sun Protected (CK-) Sun Protected (CK+)
007 3 3 2 2 5 5
015 3 3 3 3 5 5
080 3 3 3 3 5 6
086 3 3 3 3 5 6
SAZT20088-1 SATZ20088-3 3 3 3 3 6 6
SAZT20099-1 SAZT20099-3 3 3 3 3 6 6
Code
##############################################
## Create model data so gene expression
## and metadata are together in one dataset
##############################################

# Get the assay data for the modeling
# CPM
assay <- data.frame(t(assayDataElement(target_demoData, elt = elt)))


# Here we merge the metadata and the assay data into one dataframe and add the 
# trajectory to the regions. 
model_dat <- merge(target_demoData@phenoData@data, assay, by="row.names") %>%
  mutate(
    regionT =  case_when(region == "Sun Protected" ~1,
                         region=="AK Edge"~2,
                         region == "AK Center"~ 3), 
    
    subject_ID = as.factor(!!as.name(subject_id))
  )


# Split the data into CKPlus and CKMinus 
model_datPlus <- model_dat %>%
  filter(SegmentLabel=="panckPos")


model_datMinus <- model_dat %>%
  filter(SegmentLabel=="panckNeg")



# Run trajectory analysis on one gene
geeCKPlus <- geeglm(A2M~  regionT, family=gaussian, id=subject_ID, data = model_datPlus,
                            corstr = "exchangeable")

geeCKMinus <- geeglm(A2M~  regionT, family=gaussian, id=subject_ID, data = model_datMinus,
                              corstr = "exchangeable") 



summary(geeCKPlus)

Call:
geeglm(formula = A2M ~ regionT, family = gaussian, data = model_datPlus, 
    id = subject_ID, corstr = "exchangeable")

 Coefficients:
            Estimate  Std.err    Wald Pr(>|W|)    
(Intercept)  5.20168  0.23936 472.279   <2e-16 ***
regionT     -0.05838  0.08078   0.522     0.47    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)   0.4347  0.1097
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha  0.06825 0.04383
Number of clusters:   6  Maximum cluster size: 12 
Code
summary(geeCKMinus)

Call:
geeglm(formula = A2M ~ regionT, family = gaussian, data = model_datMinus, 
    id = subject_ID, corstr = "exchangeable")

 Coefficients:
            Estimate Std.err     Wald Pr(>|W|)    
(Intercept)   5.6208  0.0530 11241.72   <2e-16 ***
regionT       0.0320  0.0512     0.39     0.53    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)    0.179  0.0185
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha -0.00285  0.0569
Number of clusters:   6  Maximum cluster size: 12 
Code
# Create gee function to loop through all genes
trajectories <- function(dat,genes){
  # Create function for modeling each gene
      geeF <- function(X){
        
        geeM <-  geeglm(X~  regionT, family=gaussian, id=subject_ID, data =dat,
                        corstr = "exchangeable")
        
        # Get summary of results
        sum_res <- summary(geeM)
        
        # Output beta coeficient and pvalue
        c("Trajectory"= geeM$coefficients[2],
          se = sum_res$coefficients[2,2],
          "pvalue"= sum_res$coefficients[2,4],
        "intercept" = sum_res$coefficients[1,1]) 
      }
      
      # Orderdata based on subject ID
      dat = dat %>%
        arrange(subject_ID)
      
      # Run gene trajectoris on the data
      traj <- apply(dat[,genes], MARGIN = 2, FUN = geeF)
      
      # Return the result
      return(t(traj))    
        
}


#################################################################
## Run trajectory analysis
################################################################

# Identify the genes we want to run it on. 
genes <-colnames(assay)


# Plus trajectories
trajectories_plus <- trajectories(model_datPlus,genes = genes[1:1000])


# Minus trajectories
trajectories_minus <-trajectories(model_datMinus,genes = genes[1:1000])




############################################################################
## Tables to summarise results
###########################################################################

# CK + Trajectories
as.data.frame(trajectories_plus) %>%
  mutate(
    # How many had a significant trajectory?
    sig = case_when(pvalue<0.05~"sig",
                    pvalue >= 0.05 ~ "NotSig"), 
    
    # UPward or downward trajectory?
    traject = case_when(
      Trajectory.regionT < 0  & sig == "sig" ~ "Downward",
      Trajectory.regionT > 0  & sig == "sig" ~ "Upward")
  ) %>%
  summarise(SigTrajecory = sum(sig=="sig"), sigProp = SigTrajecory /n(),
            Upward = sum(traject =="Upward", na.rm = T),
            Downward = sum(traject =="Downward", na.rm = T)) %>%
  knitr::kable(col.names = c("Significant Genes", "Proportion Significant","Upward", "Downward"), 
               caption = "Significant trajectories (CK+)")
Significant trajectories (CK+)
Significant Genes Proportion Significant Upward Downward
257 0.257 79 178
Code
# Ck - trajectories
as.data.frame(trajectories_minus) %>%
  mutate(
    # How many had a significant trajectory?
    sig = case_when(pvalue<0.05~"sig",
                    pvalue >= 0.05 ~ "NotSig"), 
    
    # UPward or downward trajectory?
    traject = case_when(
      Trajectory.regionT < 0  & sig == "sig" ~ "Downward",
      Trajectory.regionT > 0  & sig == "sig" ~ "Upward")
  ) %>%
  summarise(SigTrajecory = sum(sig=="sig"), sigProp = SigTrajecory /n(),
            Upward = sum(traject =="Upward", na.rm = T),
            Downward = sum(traject =="Downward", na.rm = T)) %>%
  knitr::kable(col.names = c("Significant Genes", "Proportion Significant","Upward", "Downward"), 
               caption = "Significant trajectories (CK-)")
Significant trajectories (CK-)
Significant Genes Proportion Significant Upward Downward
250 0.25 136 114
Code
###############################################################################
## Trajectory Heatmaps
##############################################################################

##### CK plus heatmap ########################################

## Find all significant genes with upward trajectory
upgenes <- as.data.frame(trajectories_plus) %>%
  filter(pvalue < 0.05 & Trajectory.regionT>0) %>%
  arrange(-Trajectory.regionT) %>%
  rownames()

downgenes <-  as.data.frame(trajectories_plus) %>%
  filter(pvalue < 0.05 & Trajectory.regionT< 0) %>%
  arrange(Trajectory.regionT) %>%
  rownames()


# Create matrix for heatmaps
UpMat <- model_datPlus %>%
  arrange(regionT) 

downMat <-  model_datPlus %>%
  arrange(regionT) 


# Heatmaps for upward and downward genes
pheatmap(t(scale(UpMat[,upgenes[1:50]],T,T)), cluster_rows = F, cluster_cols = F,
         labels_col = UpMat$region, main = "Normalized Log-scaled CPM upward trajectories (CK-Plus)")

Code
pheatmap(t(scale(downMat[,downgenes[1:50]],T,T)), cluster_rows = F, cluster_cols = F,
         labels_col = downMat$region, main = "Normalized Log-scaled CPM downward trajectories (CK-Plus)")

Code
########## CK-Minus heatmap ################################################

## Find all significant genes with upward trajectory
upgenes <- as.data.frame(trajectories_minus) %>%
  filter(pvalue < 0.05 & Trajectory.regionT>0) %>%
  arrange(-Trajectory.regionT) %>%
  rownames()

downgenes <-  as.data.frame(trajectories_minus) %>%
  filter(pvalue < 0.05 & Trajectory.regionT< 0) %>%
  arrange(Trajectory.regionT) %>%
  rownames()


# Create matrix for heatmaps
UpMat <- model_datMinus %>%
  arrange(regionT) 

downMat <-  model_datMinus %>%
  arrange(regionT) 


# Heatmaps for upward and downward genes
pheatmap(t(scale(UpMat[,upgenes[1:50]],T,T)), cluster_rows = F, cluster_cols = F,
         labels_col = UpMat$region, main = "Normalized Log-scaled CPM upward trajectories (CK-Minus)")

Code
pheatmap(t(scale(downMat[,downgenes[1:50]],T,T)), cluster_rows = F, cluster_cols = F,
         labels_col = downMat$region, main = "Normalized Log-scaled CPM downward trajectories (CK-Minus)")

Code
#############################################################################
# Plot genes of interest
############################################################################

gene = "PGK1"


############## CK Plus trajectory #############################################
slopeG = trajectories_plus[gene,1]
interceptG = trajectories_plus[gene,4]

seG = trajectories_plus[gene,2]

# Define upper and lower bound
upperSlopeG = slopeG + 1.96 * seG 
lowerSlopeG = slopeG - 1.96 * seG 

ggplot(model_datPlus, aes(x=regionT,y=!!as.name(gene)))+
  geom_point() +
  geom_abline(slope = slopeG, intercept = interceptG) +
  geom_abline(slope = upperSlopeG, intercept = interceptG, linetype="dashed") + # Upper bound
  geom_abline(slope = lowerSlopeG, intercept = interceptG, linetype="dashed") +# Lower Bound
  ggtitle(paste0("Estimated Trajectory of ",gene," in CK-Plus"),"And 95% confidence intervals of trajectory") +
  theme_bw() +
  labs(caption = "Values are log-scaled CPM")

Code
############### CK Minus Trajectory ###########################################

slopeG = trajectories_minus[gene,1]
interceptG = trajectories_minus[gene,4]

seG = trajectories_minus[gene,2]

# Define upper and lower bound
upperSlopeG = slopeG + 1.96 * seG 
lowerSlopeG = slopeG - 1.96 * seG 

ggplot(model_datMinus, aes(x=regionT,y=!!as.name(gene)))+
  geom_point() +
  geom_abline(slope = slopeG, intercept = interceptG) +
  geom_abline(slope = upperSlopeG, intercept = interceptG, linetype="dashed") + # Upper bound
  geom_abline(slope = lowerSlopeG, intercept = interceptG, linetype="dashed") +# Lower Bound
  ggtitle(paste0("Estimated Trajectory of ",gene," in CK-Minus"),"And 95% confidence intervals of trajectory") +
  theme_bw()+
  labs(caption = "Values are log-scaled CPM standardization")